Skip to content

Hive 29574 merge join poc#6456

Open
illiabarbashov-sketch wants to merge 4 commits into
apache:masterfrom
illiabarbashov-sketch:HIVE-29574_merge_join_poc
Open

Hive 29574 merge join poc#6456
illiabarbashov-sketch wants to merge 4 commits into
apache:masterfrom
illiabarbashov-sketch:HIVE-29574_merge_join_poc

Conversation

@illiabarbashov-sketch
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

  1. New Hive configurations: hive.merge.join.skew.threshold and hive.merge.join.skew.abort
  2. A new SkewedMergeJoinMonitor class to manage those configurations and log skew join event or abort it.
  3. Unit tests to cover positive and negative cases
  4. Query tests to cover clientpositive and clientnegative test cases

Why are the changes needed?

  1. This feature adds observability to merge join operator and flags the skewed keys
  2. Clients have problem with Skewed Merge Join when the job is stuck and there are no progress and no information about the reasons behind this issue. This feature adds a configuration to abort the stuck job if the threshold is hit.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Unit tests
  • Query tests with clientpositive and clientnegative cases

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice patch @illiabarbashov-sketch so far, let some comments

skewedKeyFlagged = new boolean[maxAlias];
}

public boolean isActive() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be private, or package-protected @VisibleForTesting if unit tests need it

SET hive.vectorized.execution.enabled=false;
set hive.mapred.mode=nonstrict;
set hive.explain.user=false;
set hive.cbo.enable=false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cbo should be enabled, we can simply delete this

SET hive.vectorized.execution.enabled=false;
set hive.mapred.mode=nonstrict;
set hive.explain.user=false;
set hive.cbo.enable=false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cbo should be enabled, we can simply delete this

@@ -0,0 +1,35 @@
SET hive.vectorized.execution.enabled=false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorization should be enabled, we can simple delete this

@@ -0,0 +1,20 @@
SET hive.vectorized.execution.enabled=false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorization should be enabled, we can simple delete this

set hive.explain.user=false;
set hive.cbo.enable=true;
set hive.auto.convert.join=false;
set hive.optimize.ppd=false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is hive.optimize.ppd=false needed?

@@ -0,0 +1,35 @@
SET hive.vectorized.execution.enabled=true;
set hive.mapred.mode=nonstrict;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is hive.mapred.mode=nonstrict needed?

@@ -0,0 +1,20 @@
SET hive.vectorized.execution.enabled=true;
set hive.mapred.mode=nonstrict;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is hive.mapred.mode=nonstrict needed?

HIVE_JOIN_CACHE_SIZE("hive.join.cache.size", 25000,
"How many rows in the joining tables (except the streaming table) should be cached in memory."),
HIVE_MERGE_JOIN_SKEW_THRESHOLD("hive.merge.join.skew.threshold", -1L,
"Maximum number of rows allowed per join key in a single Tez sort-merge join task before a "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove "Tez"
even "Tez" is the only execution engine we're currently supporting, this feature is theoretically orthogonal to that, so this is rather "Maximum number of rows allowed per join key in a single sort-merge reducer join task before..."

"Maximum number of rows allowed per join key in a single Tez sort-merge join task before a "
+ "skew event is reported."),
HIVE_MERGE_JOIN_SKEW_ABORT("hive.merge.join.skew.abort", false,
"When set to true and the row count is equal to hive.merge.join.skew.threshold, the Tez task will be aborted."),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove "Tez" from here too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants